This tutorial is designed to cover the basics of the main MerquryRising.Rmd markdown script. Following this initial commit, the two Rmd files will be edited in parallel to make the main Rmd more functional, and this Rmd have more information. Core functions have been moved to the merquryrising.R script, which can be run as a standalone Rscript to make the basic figures and tables without the extra fluff.
NOTE: This Tutorial is a work in progress. Please refer to the main Rmd file and post questions on GitHub in the meantime.
The ultimate goal of a genome assembly is to generate a complete and correct representation of the sequenced genome. It is therefore useful to consider and assess the following attributes of a genome assembly (Fig1):
Figure 1. What makes a good genome assembly? Key genome attributes to assess during assembly QC.
Here, we are mostly considering Base Accuracy and Genome Completeness. Notably, this includes false duplications, where one region of the genome is present multiple times in the assembly. Typically, this arises through heterozygous diploid haplotypes being incorrectly identified as two different loci. At the opposite end of the spectrum, parts of the genome might be collapsed (represented by fewer copies in the assembly than the true genome) or missing entirely.
One common approach to analysing genome sequencing is to use kmers. Kmers are simply substrings of length k. For DNA, kmers are chunks of k nucleotides (Fig2). For any given DNA sequence, there will be a kmer starting at every position in the sequence except the last k-1 kmers. For long sequences, such as vertebrate genomes, the total kmer count therefore tends towards the sequence length. However, not all kmers will be unique. Where repetitive sequences exist, the same kmer sequence may appear several times.
Figure 2. Kmers. Example kmers for an invented DNA string (k=21).
In addition to breaking a genome assembly, the raw sequencing data can similarly be broken down into kmers. In this case, it is expected for kmers to appear multiple times, as each region of the genome will have been sequenced multiple times. Such data is often described or plotted as a kmer distribution (Fig3). Here, kmer counts (the number of different kmers that occur) are plotted against kmer multiplicity or coverage (the number of time the same kmer sequence is observed). This is akin to plots of numbers of bases against read depth.
Figure 3. Kmer distributions. Example GenomeScope kmer distributions from GenomeScope. Left panel, heterozygosity moves kmers from the diploid coverage peak to a haploid coverage peak at half the multiplicity level. Right panel,
Heterozygosity. For unique haploid or homozygous diploid genomes, one would expect a single dominant kmer peak, corresponding to the average number of times a part of the genome was sequenced (approx. the sequencing depth for long reads). For heterozygous diploid genomes, each SNP between haplotypes will move k kmers from the diploid peak and add 2k kmers to the haplod peak at half the coverage (Fig3, left). This is because each SNP sits in k kmers at different positions, and each will now have two different versions at approximately half the multiplicity.
Repeats. Repetitive sequences move kmers in the opposite direction. kmers in repeats will shift into the higher multiplicity section of the plot. Duplicated loci will have a peak at approx. double the diploid peak, whilst more frequent repeats can have very high kmer multiplicity values.
Merqury is a freely available tool (Fig4) for comparing the kmer distributions from genome assemblies and raw reads. One of the primary uses of Merqury is to assess the kmer completeness and base error rate (translated into a Quality Value (QV)) based on comparing the different kmers found in the raw reads with those found in the assembly (see Merqury paper and Gitub for more details). These can be found in the *.qv and *.stats output files. Whilst not the focus of MerquryRising, these files can be loaded and incorporated into the summary tables (see below). Merqury will aslo generate kmer distribution plots, which can be used to assess whether the genome assembly is accurately representing the genome itself. This is what MerquryRising is designed to assist with.
Figure 4. Merqury. Merqury is freely available on GitHub and has been published in Genome Biology.
Merqury generates similar kmer plots to GenomeScope, but with the important addition that the kmers in the raw reads are classified according to how many times they appear in the assembly (Fig5):
Figure 5. GenomeScope and Merqury plots. Kmers drawn from parts of the genome assembly with different attributes will contribute to different parts of the kmer distribution.
Merqury can also be run in a Diploid mode (Fig6), co-analysing the kmer distributions of both haplotypes. Here, the kmer counts in the assembly should match the predicted ploidy of the peaks (Fig6, left). Merqury can also plot kmers that are shared between the two haplotypes (Fig6, right).
Figure 6. Diploid Merqury plots. Note that the plot on the left superimposes the individual assembly-count distributions over each other, whereas the right plot is a stacked kmer frequency.
Whilst incredibly useful, there are some limitations with relying on the Merqury plots to assess whether a genome assembly needs purging of false duplicates, or whether changes to a genome assembly have made things better or worse. Visual assessment of plots lacks automation and reproducibility, whilst it can be challenging to compare two plots and understand what has changed and how (Fig7). Because the total number of kmers represented is the product of both axes, it can be hard to intuitively assess how much of the assembly is captured by each peak. Furthermore, the high-coverage repetitive regions are not plotted (Fig5, Fig8).
Figure 7. Comparing Merqury plots. In this example, purging has reduced the small diploid-coverage peak of kmers represented twice in the genome, which probably represent false duplicates. However, it is not clear how much this reduction equates too, nor whether it has been balanced by loss of good kmers from elsewhere in the distribution.
MerquryRising seeks to overcome these challenges in two ways. First, kmers are classified according to their multiplicity in both the raw reads and the assembly (Fig8, see details in later sections), and each kmer class quantified using both its multiplicity and kmer count.
Figure 8. MerquryRising classification. MerquryRising assigns kmers to purging classes based on their frequency in both the assembly and the raw read data (left). Quantification of kmers includes the long tail of repeat kmers (right).
Second, MerquryRising will generate pairwise comparisons between assemblies to identify how kmers have moved between classes in response to purging or gap-filling etc. (Fig9). These are explored in more detail in the next section. An example MerquryRising run can also be found in the accompanying MerquryRising.html file.
Figure 9. Comparative MerquryRising plots. MerquryRising will convert a Merqury plot into a kmer distribution difference versus a different version of the assembly (left panel) and plot kmer classes across multiple assembly versions (right panel) to help assess the effects of assembly manipulations.
This section provides additional details and background to the MerquryRising workflow.
The general MerquryRising workflow is as follows:
merquryrising.config file and (optionally) assembly labels/order from merquryrising.fofn. Establish list of input assembly files and build alias list (G1 to Gn).*.spectra-cn.hist and *.only.hist into tibbles with 3 fields: afreq (kmer frequency in assembly), rfreq (kmer frequency in reads) and knum (number of different kmers). Load *.qv and *.stats files if present.afreq and rfreq.knum by rfreq and then convert into a percentage.These steps are expanded, below.
The main input for MerquryRising is the set of *.spectra-cn.hist and corresponding .only.hist files, with optional qv/*.merqury.qv and stats/*.merqury.completeness.stats files. The latter are matched on the name field, which is parsed from the *.spectra-cn.hist and split on .merqury.. (See the example data if unclear.)
Optionally, assembly names will be mapped from the labels=FILE, otherwise the label will be the “G” number, G1 to Gn. MerquryRising will report the input files and assemblies recognised for this run in a table:
| G | label | name | hist | only | qv | completeness | base |
|---|---|---|---|---|---|---|---|
| G1 | vhi2.0.hifiasm | LutQuin_218497_OG10G.v230216hi2.0.hifiasm | TRUE | TRUE | 60.3855 | 86.3066 | LutQuin_218497_OG10G.v230216hi2.0.hifiasm.merqury.LutQuin_218497_OG10G.v230216hi2.0.hifiasm |
| G2 | vhi2.1.pafscaff | LutQuin_218497_OG10G.v230216hi2.1.pafscaff | TRUE | TRUE | 60.3855 | 86.3066 | LutQuin_218497_OG10G.v230216hi2.1.pafscaff.merqury.LutQuin_218497_OG10G.v230216hi2.1.pafscaff |
| G3 | vhi2.2.diploidocus | LutQuin_218497_OG10G.v230216hi2.2.diploidocus | TRUE | TRUE | 61.7651 | 86.1320 | LutQuin_218497_OG10G.v230216hi2.2.diploidocus.merqury.LutQuin_218497_OG10G.v230216hi2.2.diploidocus |
| G4 | vhi2.3.ragtag | LutQuin_218497_OG10G.v230216hi2.3.ragtag | TRUE | TRUE | 61.7651 | 86.1320 | LutQuin_218497_OG10G.v230216hi2.3.ragtag.merqury.LutQuin_218497_OG10G.v230216hi2.3.ragtag |
| G5 | vhi2.4.tweak | LutQuin_218497_OG10G.v230216hi2.4.tweak | TRUE | TRUE | 61.7651 | 86.1320 | LutQuin_218497_OG10G.v230216hi2.4.tweak.merqury.LutQuin_218497_OG10G.v230216hi2.4.tweak |
| G6 | vhi2.5.hypo | LutQuin_218497_OG10G.v230216hi2.5.hypo | TRUE | TRUE | 40.4451 | 86.1701 | LutQuin_218497_OG10G.v230216hi2.5.hypo.merqury.LutQuin_218497_OG10G.v230216hi2.5.hypo |
| G7 | vhi2.6.tidy2 | LutQuin_218497_OG10G.v230216hi2.6.tidy2 | TRUE | TRUE | 40.4444 | 86.1680 | LutQuin_218497_OG10G.v230216hi2.6.tidy2.merqury.LutQuin_218497_OG10G.v230216hi2.6.tidy2 |
where:
*.fofn file.*.hist file name.*.hist file was found.*.only.hist file was found.*.qv file (NA if missing).*.stats file (NA if missing).basename for that assembly.The next step is set the ploidy boundaries that establish rfreq categories as low, haploid, diploid or high frequencies. Assembly-only kmers are classified as only. Ploidy boundaries can be set for all assemblies using the boundary values loaded from the ploidy file. However, but these appear to be quite conservative and result in a lot of kmers being rated as over- or under-purged based on the merqury plots themselves, with a lot of the main distributions broader than the values loaded. Ploidy boundaries are therefore set by default to be determined from the histograms themselves. For consistent boundaries, provide a single *.hist file to the ploidy=FILE setting.
When *.hist files are used (assembly-specific or general), the diploid peak is first identified as the kmer density peak for rfreq values up to 1000. The boundaries are then calculated based on this diploid peak:
haploid frequency is set to half diploid.low boundary is set to the maximum value of 5, or 0.25 haploid. Upto this value of rfreq (raw read kmer frequency), kmers are rated low. This is plotted as a solid black line on the Comparative Merqury plots.mid boundary is set to halfway between haploid and diploid. Raw kmer frequencies above the low boundary and upto the mid boundary hap. This is plotted as a solid red line on the Comparative Merqury plots, with the haploid peak a dotted red line.high boundary is set to 1.5 diploid. Raw kmer frequencies above the hap boundary and upto the high boundary are rated dip. This is plotted as a solid blue line on the Comparative Merqury plots, with the diploid peak a dotted blue line.high boundary, kmers are rated as high.| G | haploid | diploid | low | mid | high |
|---|---|---|---|---|---|
| G1 | 28.66608 | 57.33215 | 7.166519 | 42.99911 | 85.99823 |
| G2 | 28.66608 | 57.33215 | 7.166519 | 42.99911 | 85.99823 |
| G3 | 28.66607 | 57.33215 | 7.166518 | 42.99911 | 85.99822 |
| G4 | 28.66607 | 57.33215 | 7.166518 | 42.99911 | 85.99822 |
| G5 | 28.66607 | 57.33215 | 7.166518 | 42.99911 | 85.99822 |
| G6 | 28.66609 | 57.33218 | 7.166523 | 42.99914 | 85.99827 |
| G7 | 28.66609 | 57.33218 | 7.166523 | 42.99914 | 85.99827 |
NOTE: For very heterozygous genomes, where the haploid peak exceeds the diploid peak, the diploid peak calculation will break. For these, it is advised to make and load a *.ploidy file to use for all assemblies.
NOTE: The classification percentages are ultimately defined by the ploidy boundaries, which do not fully capture the extent of each sub-distribution of kmers. As a consequence, there will always be a degree of mis-assigned kmers. For this reason, percentages are best considered as comparative values rather than absolute values.
Kmer classes are assigned based on the read frequency (rfreq) and assembly frequency (afreq), ploidy boundaries (above), and whether the assembly is haploid or diploid. This can be specified for all assemblies, else any assemblies with a .dip suffix will be treated as diploid (and everything else haploid). Classification is first based on more specific categories, which are then grouped into broader purge categories (described below).
The full list of classifications is given below, with the detailed class and higher-level purge ratings that get assigned to each afreq-rfreq combination. The dipclass and dippurge ratings are used for diploid assemblies. Ratings and definitions are described in more detail in the following sections.
| afreq | rfreq | class | purge | dipclass | dippurge |
|---|---|---|---|---|---|
| 1 | none | only | only | only | only |
| 2 | none | only | only | only | only |
| 3+ | none | only | only | only | only |
| 0 | low | noise | n/a | noise | n/a |
| 1 | low | lowQ | under | lowQ | under |
| 2 | low | lowQ | under | lowQ | under |
| 3+ | low | lowQ | under | lowQ | under |
| 0 | hap | alternate | good | missing | over |
| 1 | hap | haploid | good | haploid | good |
| 2 | hap | duplicate | under | duplicate | under |
| 3+ | hap | duplicate | under | duplicate | under |
| 0 | dip | missing | over | missing | over |
| 1 | dip | diploid | good | haploid | over |
| 2 | dip | duplicate | under | diploid | good |
| 3+ | dip | duplicate | under | duplicate | under |
| 0 | high | missing | over | missing | over |
| 1 | high | collapsed | over | collapsed | over |
| 2 | high | collapsed | over | collapsed | over |
| 3+ | high | repeats | neutral | repeats | neutral |
Following classification of each assembly, values are converted to percentages based on the raw kmer counts. This is done by multiplying the number of different kmers (knum) with a given rfreq count by that count (rfreq x knum).
NOTE: In principle, it would be appealing to use the assembly frequencies for this and report the percentage of the assembly in the different categories, rather than the percentage of raw kmers. However, because Merqury truncates assembly kmer counts at 4, this is not possible. It would also be unable to report the percentage of missing kmers.
NOTE: By definition, assembly-only (only) kmers are not found in the raw data. To add a percentage figure for this class, a set of mock raw read kmers are added using the haploid frequency (see ploidy boundaries) for kmers found once in the assembly, and the mean diploid frequency for kmers found twice. Assembly-only kmers found 3+ times are excluded. Whereas the total raw kmer count will be consistent between assemblies (when analysed with the same raw data), this only value will differ between assemblies, and thus might cause minor changes in the percentage values for other categories. (See comparative analyses, below.) To avoid this, the only kmers can be excluded entirely by setting only=FALSE in the config settings.
MerquryRising uses the following kmer classes (Fig10):
low frequency. These are most likely to be base-calling Read Errors, or contamination.low abundance in the raw data but also in the assembly indicate Low Quality regions and possible contamination.haploid frequency in the raw data but appearing more than once in the assembly are flagged as probable Duplicate regions. kmers at diploid frequency found more than once in a haploid assembly, or more than twice in a diploid assembly, are also flagged as Duplicate. These could represent scenarios where a haploid assembly contains both alleles, or where both haplotypes contain a hemizygous region that is only present in one haplotype.haploid frequency kmers that are not found in a haploid assembly are assumed to be from the Alternate Haplotype. (Where the assembly has been phased, can be confirmed by running Merqury on the combined diploid assembly of both haplotypes.)haploid frequency in the raw data but and once in the assembly are flagged as Haplotype Specific kmers (or Haploid kmers).diploid frequency in the raw data but and once in a haploid assembly, or twice in a diploid assembly, are flagged as Diploid kmers.high frequency in the raw data and three or more times in the assembly are flagged as Repeats. These are not obviously collapsed but it is not possible to assess them for under- or over-representation based on these data. (Note that rDNA repeats will normally be in this category but be collapsed.)high frequency in the raw data but only one or two times in the assembly are flagged as Collapsed repeats that are under-represented in the assembly.diploid or high frequency in the raw data but missing from the assembly are flagged as Missing. For diploid assemblies, kmers found at haploid frequency in the raw data but missing from the assembly are also flagged as Missing. Figure 10. MerquryRising Kmer Classification. MerquryRising assigns kmers to different classes based on its raw read multiplicity (
rfreq) and the number of times it is seen in the assembly (afreq), based on whether the assembly is haploid (left) or a complete diploid, i.e. both haplotypes in one file (right). Text boxes of kmer ratings are coloured according to the assembly frequency. (See also text descriptions and table.) NOTE: Both panels are the same underlying Merqury plot. In reality, the diploid case (right) should look more like the left panel of Fig6.
Following classification of each assembly, a summary table (below) and plots (Fig 11) is generated for all assemblies. This will be saved to a tab-delimited file, *.class.tsv, with the figure saved as plots/*.class.png and plots/*.class.pdf.
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.001 | 3.370 | 0.008 | 1.930 | 6.319 | 9.748 | 61.664 | 13.383 | 3.193 | 0.384 |
| vhi2.1.pafscaff | 0.001 | 3.370 | 0.008 | 1.930 | 6.319 | 9.748 | 61.664 | 13.383 | 3.193 | 0.384 |
| vhi2.2.diploidocus | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
| vhi2.3.ragtag | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
| vhi2.4.tweak | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
| vhi2.5.hypo | 0.094 | 3.370 | 0.005 | 0.582 | 5.915 | 10.253 | 62.947 | 13.345 | 3.211 | 0.278 |
| vhi2.6.tidy2 | 0.094 | 3.370 | 0.005 | 0.577 | 5.916 | 10.252 | 62.952 | 13.343 | 3.213 | 0.278 |
Figure 11. Kmer Classification Percentages. MerquryRising kmer classes are plotted for each assembly in the order loaded or provided by histsort.
Classes are grouped into purge categories based on the observed assembly frequency (afreq) versus the expectation derived from the raw data (rfreq) (Fig12):
only)noise)lowQ, duplicate)alternate, haploid, diploid)repeats)collapsed, missing)NOTE: The only class could indicate assembly errors, or it could indicate missing kmers due to sequencing biases. As such, they are excluded from the classification, as are the noise kmers, which are suspected read errors or contamination.
Figure 12. MerquryRising Kmer Purge Ratings. Each kmer class is assigned to a purging class based on whether it occurs more, less, or as expected in the assembly based on the raw data. NOTE: Both panels are the same underlying Merqury plot. In reality, the diploid case (right) should look more like the left panel of Fig6.
A summary table of purge ratings (below) is generated for all assemblies and saved to a tab-delimited file, *.rating.tsv. If provided, the QV and Completeness scores from Merqury are added to the table. A plot of all assemblies is also generated (Fig 13) and saved as plots/*.rating.png and plots/*.rating.pdf.
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.001 | 3.370 | 1.937 | 77.731 | 13.383 | 3.577 | 86.31 | 60.39 |
| vhi2.1.pafscaff | 0.001 | 3.370 | 1.937 | 77.731 | 13.383 | 3.577 | 86.31 | 60.39 |
| vhi2.2.diploidocus | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| vhi2.3.ragtag | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| vhi2.4.tweak | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| vhi2.5.hypo | 0.094 | 3.370 | 0.587 | 79.115 | 13.345 | 3.489 | 86.17 | 40.45 |
| vhi2.6.tidy2 | 0.094 | 3.370 | 0.582 | 79.120 | 13.343 | 3.491 | 86.17 | 40.44 |
Figure 13. Purge Rating Percentages. MerquryRising purge ratings are plotted for each assembly in the order loaded or provided by histsort.
One of the main use cases for MerquryRising is to interrogate the effects of different stages in a genome assembly workflow. For example, it is useful to know whether a purging step has been effective, or has been too harsh. To achieve this, MerquryRising will also calculate the change between assembly stages and a reference assembly. This can be done in one of two modes:
Default Mode. Each assembly is compared to the previous assembly. This is useful, as in the example below, when each assembly is a different stage in an assembly workflow. The order will either be defined by the name of the files read, or by changing the histsort setting and/or the *.fofn label file. (The simplest approach is to number assemblies in the desired order.) The first assembly will be compared to itself and thus have relative values of 0.0% for all categories.
Reference Mode. Providing a specific reference to the reference=X setting will compare all assemblies to the same reference.
As with absolute values, relative purge ratings are reported as a table and figure, saved to plots/*.relrating.pdf and plots/*.relrating.png:
| assembly | only | n/a | under | good | neutral | over | completeness | qv | G |
|---|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.00 | 0.00 | G1 |
| vhi2.1.pafscaff | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.00 | 0.00 | G2 |
| vhi2.2.diploidocus | -0.000 | 0.002 | -1.353 | 1.334 | -0.193 | 0.211 | -0.17 | 1.38 | G3 |
| vhi2.3.ragtag | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.00 | 0.00 | G4 |
| vhi2.4.tweak | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.00 | 0.00 | G5 |
| vhi2.5.hypo | 0.093 | -0.002 | 0.004 | 0.049 | 0.155 | -0.299 | 0.04 | -21.32 | G6 |
| vhi2.6.tidy2 | -0.000 | 0.000 | -0.006 | 0.005 | -0.002 | 0.002 | -0.00 | -0.00 | G7 |
Figure 14. Relative Purge Rating Percentages. MerquryRising purge ratings are plotted relative to reference assemblies for each assembly in the order loaded or provided by histsort.
In this example, we can see that the Diploidocus tidy phase has successfully reduced the underpurged kmers in the assembly and increased the good category. However, this appears to have come at a slight cost of moving some of the neutral repeat kmers into overpurged. Later, the HyPo polishing and gap-filling step restores some of the overpurged kmers, returning them to the neutral and good classes, albeit with an increase in assembly-only kmers that is also reflected by a drop in mean QV. These could represent kmers from lower-quality gap-filled regions, or kmers not captured by a different sequencing technology. Less clear from the plot, the table shows that the second Diploidocus tidy step also works to reduce the underpurged kmers very slightly in favour of good.
If more information is required, the same tabulation and plotting is done for the more detailed breakdown by kmer class:
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing | G |
|---|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | G1 |
| vhi2.1.pafscaff | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | G2 |
| vhi2.2.diploidocus | -0.000 | 0.002 | -0.002 | -1.352 | 0.070 | 0.047 | 1.217 | -0.193 | 0.165 | 0.046 | G3 |
| vhi2.3.ragtag | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | G4 |
| vhi2.4.tweak | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | G5 |
| vhi2.5.hypo | 0.093 | -0.002 | -0.001 | 0.005 | -0.474 | 0.457 | 0.066 | 0.155 | -0.147 | -0.153 | G6 |
| vhi2.6.tidy2 | -0.000 | 0.000 | -0.000 | -0.006 | 0.001 | -0.001 | 0.005 | -0.002 | 0.001 | 0.001 | G7 |
Figure 15. Relative kmer Class Percentages. MerquryRising kmer classes are plotted relative to reference assemblies for each assembly in the order loaded or provided by histsort.
In this example, we can see that the HyPo step had an additional impact that was not captured by the haploid purge classes. The biggest shift was a reduction in alternate and increase in haploid kmers. This could indicate a reduction of missing haplotype-specific sequences, but analysis of a combined diploid assembly of both haplotypes would be needed to confirm this.
The MerquryRising Rmarkdown implementation enables further drilling down into the differences between assemblies, by plotting the kmer distribution itself as a comparative plot between assembly versions (Figs 16-20). This gives a clearer impression of where the kmers are changing in copy number in the assembly, although note that the read-only kmers (afreq=0) may be truncated at low kmer frequency values, and changes at high frequencies will also be truncated.
NOTE: Comparative plots are only generated by the Rmarkdown implementation of MerquryRising.
For the reference assembly itself, slightly less information is returned as all the comparisons will show zero difference. Instead, just the ratings and classes are reported, along with the kmer distribution plot. (The difference plot (right panel) is still generated but has no content as there are no changes.)
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.001 | 3.370 | 1.937 | 77.731 | 13.383 | 3.577 | 86.31 | 60.39 |
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.0.hifiasm | 0.001 | 3.370 | 0.008 | 1.930 | 6.319 | 9.748 | 61.664 | 13.383 | 3.193 | 0.384 |
Figure 16. Comparative kmer distributions for reference assembly. Lines and areas under the curve are coloured by frequency in the assembly. Left, the kmer distribtuion plot for the assembly. Right, the relative kmer distribution compared to the reference.
Non-reference assemblies generate more output, with the stats for the reference (previous assembly stage by default) and the change in rating proportions, along with the stats for the assembly stage itself. The right panel of the comparative kmer distribution plot is now informative, and shows where kmer disributions have changed. Negative regions in the plots show where there has been a reduction in a given afreq-rfreq combination, whilst the positive regions show increases.
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.1.pafscaff | 0.001 | 3.370 | 1.937 | 77.731 | 13.383 | 3.577 | 86.31 | 60.39 |
| change | -0.000 | 0.002 | -1.353 | 1.334 | -0.193 | 0.211 | -0.17 | 1.38 |
| vhi2.2.diploidocus | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.1.pafscaff | 0.001 | 3.370 | 0.008 | 1.930 | 6.319 | 9.748 | 61.664 | 13.383 | 3.193 | 0.384 |
| change | -0.000 | 0.002 | -0.002 | -1.352 | 0.070 | 0.047 | 1.217 | -0.193 | 0.165 | 0.046 |
| vhi2.2.diploidocus | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
Relative change of LutQuin_218497_OG10G.v230216hi2.2.diploidocus versus LutQuin_218497_OG10G.v230216hi2.1.pafscaff:
Figure 17. Comparative kmer distributions for Diploidocus purging. Lines and areas under the curve are coloured by frequency in the assembly. Left, the kmer distribtuion plot for the assembly. Right, the relative kmer distribution compared to the previous assembly stage.
From this, it can be seen that the main change is in duplicate kmers within the diploid frequency region of the kmer distribution being reduced from double-copy to single-copy in the assembly, along with some removed altogether from the haploid frequency range.
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.5.hypo | 0.094 | 3.370 | 0.587 | 79.115 | 13.345 | 3.489 | 86.17 | 40.45 |
| change | -0.000 | 0.000 | -0.006 | 0.005 | -0.002 | 0.002 | -0.00 | -0.00 |
| vhi2.6.tidy2 | 0.094 | 3.370 | 0.582 | 79.120 | 13.343 | 3.491 | 86.17 | 40.44 |
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.5.hypo | 0.094 | 3.370 | 0.005 | 0.582 | 5.915 | 10.253 | 62.947 | 13.345 | 3.211 | 0.278 |
| change | -0.000 | 0.000 | -0.000 | -0.006 | 0.001 | -0.001 | 0.005 | -0.002 | 0.001 | 0.001 |
| vhi2.6.tidy2 | 0.094 | 3.370 | 0.005 | 0.577 | 5.916 | 10.252 | 62.952 | 13.343 | 3.213 | 0.278 |
Relative change of LutQuin_218497_OG10G.v230216hi2.6.tidy2 versus LutQuin_218497_OG10G.v230216hi2.5.hypo:
Figure 18. Comparative kmer distributions for Diploidocus purging. Lines and areas under the curve are coloured by frequency in the assembly. Left, the kmer distribtuion plot for the assembly. Right, the relative kmer distribution compared to the previous assembly stage.
The magnitude of the final tidy step is considerably smaller, but shows a similar shift from two-copy to one-copy kmers from the diploid frequency range.
Purging steps, such as Diploidocus tidy, are explicitly trying to purge duplications. Other assembly steps can have less predictable (or desirable) outcomes. HyPo is a joint polishing and gap-filling tool. In this example, is seems to have reintroduced some kmers and removed some others.
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.4.tweak | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| change | 0.093 | -0.002 | 0.004 | 0.049 | 0.155 | -0.299 | 0.04 | -21.32 |
| vhi2.5.hypo | 0.094 | 3.370 | 0.587 | 79.115 | 13.345 | 3.489 | 86.17 | 40.45 |
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.4.tweak | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
| change | 0.093 | -0.002 | -0.001 | 0.005 | -0.474 | 0.457 | 0.066 | 0.155 | -0.147 | -0.153 |
| vhi2.5.hypo | 0.094 | 3.370 | 0.005 | 0.582 | 5.915 | 10.253 | 62.947 | 13.345 | 3.211 | 0.278 |
Relative change of LutQuin_218497_OG10G.v230216hi2.5.hypo versus LutQuin_218497_OG10G.v230216hi2.4.tweak:
Figure 19. Comparative kmer distributions for HyPo polishing and gap-filling. Lines and areas under the curve are coloured by frequency in the assembly. Left, the kmer distribtuion plot for the assembly. Right, the relative kmer distribution compared to the previous assembly stage.
Some assembly stages are not expected to change kmers. Telociraptor, for example, trims or inverts the end of scaffolds at assembly gaps (which are not included in Merqury kmers). In these situations, the comparative kmer plots are similar to the reference assembly and show no difference.
| assembly | only | n/a | under | good | neutral | over | completeness | qv |
|---|---|---|---|---|---|---|---|---|
| vhi2.3.ragtag | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| change | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.00 | 0.00 |
| vhi2.4.tweak | 0.001 | 3.372 | 0.584 | 79.066 | 13.189 | 3.788 | 86.13 | 61.77 |
| assembly | only | noise | lowQ | duplicate | alternate | haploid | diploid | repeats | collapsed | missing |
|---|---|---|---|---|---|---|---|---|---|---|
| vhi2.3.ragtag | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
| change | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| vhi2.4.tweak | 0.001 | 3.372 | 0.006 | 0.578 | 6.389 | 9.796 | 62.881 | 13.189 | 3.358 | 0.431 |
Relative change of LutQuin_218497_OG10G.v230216hi2.4.tweak versus LutQuin_218497_OG10G.v230216hi2.3.ragtag:
Figure 20. Comparative kmer distributions for Telociraptor telomere tweaking. Lines and areas under the curve are coloured by frequency in the assembly. Left, the kmer distribtuion plot for the assembly. Right, the relative kmer distribution compared to the previous assembly stage.
MerquryRising is a standalone RMarkdown file for parsing outputs from Merqury and generating some additional plots to help assess the requirements and/or consequences of duplicate purging in genome assemblies. The GitHub README will also contain some information about running the script. If you are not getting the outputs you expect, please check or update the merquryrising.config configuration file, consisting of tab-delimited pairs of setting and value.
The Merqury files needed are:
*.spectra-cn.hist files of read and assembly kmer counts.*.only.hist files of assembly-only kmer counts.*.hist.ploidy file given with ploidyfile=FILE. If this is a *.hist file, this will be used to set all the boundaries.Input files are read from the directory specified by merqurydir=PATH (default:merqury) and are set by histfiles=X (wildcards allowed, default:*.spectra-cn.hist). A matching set of *.only.hist files are also expected. Human-friendly labels can be provided with a merquryrising.fofn file (labels=FILE) that consists of label filename as plain text. filename in this instance is the full name of the *.spectra-cn.hist file, excluding the path, e.g.:
vhi2.0.hifiasm LutQuin_218497_OG10G.v230216hi2.0.hifiasm.merqury.LutQuin_218497_OG10G.v230216hi2.0.hifiasm.spectra-cn.hist
vhi2.1.pafscaff LutQuin_218497_OG10G.v230216hi2.1.pafscaff.merqury.LutQuin_218497_OG10G.v230216hi2.1.pafscaff.spectra-cn.hist
vhi2.2.diploidocus LutQuin_218497_OG10G.v230216hi2.2.diploidocus.merqury.LutQuin_218497_OG10G.v230216hi2.2.diploidocus.spectra-cn.hist
vhi2.3.ragtag LutQuin_218497_OG10G.v230216hi2.3.ragtag.merqury.LutQuin_218497_OG10G.v230216hi2.3.ragtag.spectra-cn.hist
vhi2.4.tweak LutQuin_218497_OG10G.v230216hi2.4.tweak.merqury.LutQuin_218497_OG10G.v230216hi2.4.tweak.spectra-cn.hist
vhi2.5.hypo LutQuin_218497_OG10G.v230216hi2.5.hypo.merqury.LutQuin_218497_OG10G.v230216hi2.5.hypo.spectra-cn.hist
vhi2.6.tidy2 LutQuin_218497_OG10G.v230216hi2.6.tidy2.merqury.LutQuin_218497_OG10G.v230216hi2.6.tidy2.spectra-cn.hist
MerquryRising makes use of the following libraries:
library(tidyverse)
library(ggridges)
library(grDevices)
library(GGally)
library(RColorBrewer)
library(writexl)
library(kableExtra)
library(tools)
library(patchwork)
There are two main ways to run MerquryRising. For assembly QC, it is recommended to run using the MerquryRising.Rmd file, which can then be edited with descriptions of the data to generate a QC report:
merqury outputs. (See GitHub repo for an example.)MerquryRising.Rmd and update the settings as appropriate (or edit the config file).merquryrising.fofn file to establish aliases for the assemblies and set the ordering to be used by histsort.The Merqury files needed are: * A set of *.only.hist files of assembly-only kmer counts. * A set of *.spectra-cn.hist files of read and assembly kmer counts.
To automate this, the Rmarkdown file can be knitted on the commandline:
Rscript -e 'library(rmarkdown); rmarkdown::render("MerquryRising.Rmd", "html_document")'
Alternatively, the merquryrising.R script can be run as:
Rscript merquryrising.R basefile=<PREFIX> [histfiles=<HISTFILES>] [labels=FOFN] [makexlsx=TRUE]
This will generate the main tables and kmer classification plots, but does not peform all the pairwise comparisons and plots.
The Rscript can also be incorporated into R code, as implemented within the Rmarkdown. Here, a character vector named override is setup with the commandline options, e.g.:
override <- c('fullrun=FALSE','rdir=.','debug=FALSE','dev=TRUE','makexlsx=TRUE')
NOTE: the rdir=PATH setting should point to the directory containing the MerquryRising R scripts.
Then, the main Rscript is called with source():
source('./merquryrising.R')
Running with fullrun=TRUE will run the script as if run from the commandline with all the same outputs, whilst setting fullrun=FALSE (as in the Rmarkdown) will generate all the tables that can then be output or plotted by additional function calls. (See the Rmarkdown.)
The following arguments can be set in the config file (config=FILE), the override vector, or on the commandline when running the Rscript directly:
basefile=STR : Prefix for output files (log and plots) ["merquryrising"]
boundary=X : Whether calculate boundaries or load from ploidy file if provided. ["calculate"]
config=FILE : Name of the config file ["merquryrising.config"]
debug=FALSE : Run in debugging mode with extra outputs [FALSE]
dev=FALSE : Run in development mode [FALSE]
digits=INT : Number of decimal places for tabular output [3]
diploid=INT : The diploid kmer frequency. IF 0 will calculate. If -1 will load from ploidy table if present. [0]
fullrun=TRUE : Full Rscript execution including saving outputs [TRUE]
histfiles=FILES: list.files() pattern match for input files ["*.spectra-cn.hist"]
histsort=BOOL : Optional re-ordering of input files (comma separated list). Set FALSE to match loaded file list. [TRUE]
labels=FILE : FOFN with labels mapping on to histfiles ["merquryrising.fofn"]
makexlsx=BOOL : Make TRUE to generate compiled Excel file [FALSE]
merqurydir=PATH: Directory containing merqury output files for processing ["merqury"]
only=BOOL : Whether to add extra kmers to represent the assembly-only fraction [TRUE]
outlog=FILE : Optional filename for log output. [stdout()]
pdfheight=NUM : Basic height of PDF output in inches [4]
pdfscale=NUM : Scaling factor for PDF output [1]
pdfwidth=NUM : Width of PDF output in inches [12]
ploidy=X : Ploidy of the assembly (hap/dip). If default/parse, will look for a "dip" suffix in the assembly name. ["default"]
ploidyfile=FILE: Name of the ploidy file. If default" will use the first *.ploidy file ["default"]
plotdir=FILE : Output plotting directory ["plots"]
pngheight=INT : Height of PNG output in pixels [600]
pngwidth=INT : Width of PNG output in pixels [1200]
rdir=PATH : Path to the MerquryRising R scripts [""]
reference=X : Assembly to use as reference ["default"]
rscript=TRUE : Whether the code is running as an Rscript [TRUE]
tutorial=FALSE : Run in Tutorial mode with modified functions [FALSE]
NOTE: Please check the Rscript for updated settings, which are set in the definition of the defaults list.